To view an example result set, open a web browser to…
https://uic-ric.github.io/examples/amplicon/
If obtaining basic processing results from RIC, then you should have the following files.
report.html to view the basic processing report.Raw sequence counts of the taxonomic summaries are provided in two formats
https://uic-ric.github.io/examples/amplicon/)taxa_raw_counts.xlsx file to open in Excel.The taxa_raw_counts.xlsx spreasheet will have separate tabs for each taxonomic level, i.e. phylum to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the raw sequence counts in each sample.
Prior to generating the relative sequence counts, the data are typically filtered to remove counts from chloroplast or mitochondria. The counts are then expressed as a fraction of the total for each sample, such that the total for each sample should be 1. Relative sequence counts of the taxonomic summaries are provided in two formats.
https://uic-ric.github.io/examples/amplicon/)taxa_relative.xlsx file to open in Excel.The taxa_relative.xlsx spreasheet will have separate tabs for each taxonomic level, i.e. phylum to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the relative sequence counts in each sample.
NOTE: The normalization method in these data are rather simplistic. For your analyses you may need a more sophisticated normalization method.
Prior to these exercies, make sure you have completed the following.
biomformat R packagebiomformat packageIf you have not previously done so, install the package biomformat. The biomformat package is available via bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomformat", update=F)
https://uic-ric.github.io/examples/amplicon/)taxa_table.biom file to download it and save it to the ‘workshop’ folder on your Desktop.taxa_table.biom file.# load biomformat library. Allows reading of BIOM files
library(biomformat)
# Load BIOM file for the amplicon sequence table
a_biom <- read_biom("http://wd.cri.uic.edu/metagenomics/taxa_table.biom")
# What class of object is a_biom
class(a_biom)
## [1] "biom"
## attr(,"package")
## [1] "biomformat"
# Take a peek at the BIOM file
a_biom
## biom object.
## type: OTU table
## matrix_type: sparse
## 475 rows and 20 columns
# Get data in the BIOM file
a_data <- biom_data(a_biom)
# What class of object is a_data
class(a_data)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
# Let's peek at the data
a_data[1:10, 1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## Sample10_M10_Fec2 Sample11_M12_Fec3 Sample12_M13_Fec1
## asv0000001 7492 5285 8579
## asv0000002 4185 10031 7125
## asv0000003 9262 3841 5786
## asv0000004 . . 11
## asv0000005 2941 3794 2917
## asv0000006 4676 2004 2943
## asv0000007 1503 1399 1767
## asv0000008 4013 1795 2855
## asv0000009 35 1652 2608
## asv0000010 1686 609 1176
## Sample13_M14_Fec3 Sample14_M15_Fec1
## asv0000001 5488 8666
## asv0000002 9272 6859
## asv0000003 4807 3415
## asv0000004 . .
## asv0000005 3980 2903
## asv0000006 2459 1728
## asv0000007 3271 607
## asv0000008 2263 1616
## asv0000009 275 2810
## asv0000010 291 399
# Get the observation metadata
obs_md <- observation_metadata(a_biom)
# What type of object is the obs_md
typeof(obs_md)
## [1] "list"
# Take a peek
obs_md[1:7]
## $asv0000001
## taxonomy1 taxonomy2 taxonomy3 taxonomy4
## "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## taxonomy5
## "Muribaculaceae"
##
## $asv0000002
## taxonomy1 taxonomy2 taxonomy3 taxonomy4
## "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## taxonomy5
## "Muribaculaceae"
##
## $asv0000003
## taxonomy1 taxonomy2 taxonomy3 taxonomy4
## "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## taxonomy5
## "Muribaculaceae"
##
## $asv0000004
## taxonomy1 taxonomy2 taxonomy3 taxonomy4
## "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## taxonomy5
## "Tannerellaceae"
##
## $asv0000005
## taxonomy1 taxonomy2 taxonomy3 taxonomy4
## "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## taxonomy5
## "Muribaculaceae"
##
## $asv0000006
## taxonomy1 taxonomy2 taxonomy3 taxonomy4
## "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
## taxonomy5
## "Muribaculaceae"
##
## $asv0000007
## taxonomy1 taxonomy2
## "Bacteria" "Firmicutes"
## taxonomy3 taxonomy4
## "Clostridia" "Lachnospirales"
## taxonomy5 taxonomy6
## "Lachnospiraceae" "Lachnospiraceae NK4A136 group"
The observation metadata is provided as a list of named vectors. We will convert this to a tibble (a data.frame like object). With a tibble it is possible to store a list in a column, we can then use the unnest_longer function from the tidyr package to covert the list column into separate entries in the tibble.
dplyr and tidyr packages and create the initial tibble.library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
obs_md_tbl <- tibble(obs_id=names(obs_md),
obs_md=obs_md) %>%
unnest_longer(obs_md, values_to = "taxon", indices_to = "level")
# Take a peek at the output
obs_md_tbl
NA for the nametaxa_tbl <- subset(obs_md_tbl, grepl("^taxonomy[0-9]", level) & (! is.na(taxon)))
taxa_tbl
pivot_wider function in the tidyr package to convert to wide format. In any instance where a name is missing for that level we will use the values_fill paramter to have pivot_wider fill with the value “Other”taxa_tbl_w <- pivot_wider(taxa_tbl,
names_from="level",
values_from = "taxon",
values_fill = "Other")
taxa_tbl_w
unite function from the tidyr package will allow us to quick do that on a range of columns in our tibbleobs_names <- taxa_tbl_w %>% unite("L6", taxonomy1:taxonomy6, sep=";")
obs_names
obs_names and the biom data. Granted, we could use the a_data object created previously.L6_data <- merge(obs_names[, c("obs_id", "L6")],
as.data.frame(as.matrix(biom_data(a_biom))),
by.x=1, by.y=0)
# Take a peek
head(L6_data)[,1:10]
obs_id) and then group by the L6 column and generate the sum for each sample, i.e. all other columns.L6_sum <- L6_data[,-1] %>% group_by(L6) %>% summarise(across(everything(), sum))
L6_sum
To view an example result set, open a web browser to…
https://uic-ric.github.io/examples/shotgun_metagenomics/
If obtaining basic processing results from RIC, then you should have the following files.
report.html to view the basic processing report.Raw sequence counts of the taxonomic summaries are provided in two formats
https://uic-ric.github.io/examples/shotgun_metagenomics/)taxa_raw_counts.xlsx file to open in Excel.The taxa_raw_counts.xlsx spreasheet will have separate tabs for each taxonomic level, i.e. kingdom to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the raw sequence counts in each sample.
Raw sequence counts of the functional gene orthologs (KOs) and functional gene categories are provided both as all detected functional gene orthologs (funl_raw_counts) and filtered to remove reads as defined by the taxonomic filter in the report (funl_filtered_raw_counts).
For the normalized data the following steps are performed
Relative sequence counts of the taxonomic summaries are provided in two formats.
https://uic-ric.github.io/examples/shotgun_metagenomics/)taxa_normalized_bacteria.xlsx file to open in Excel.The taxa_normalized_bacteria.xlsx spreadsheet will have separate tabs for each taxonomic level, i.e. kingdom to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the normalized sequence counts (CPM) in each sample.
NOTE: The normalization method in these data are rather simplistic. For your analyses you may need a more sophisticated normalization method.
To view an example result set, open a web browser to…
https://uic-ric.github.io/examples/shotgun_metagenomics_assembly/
If obtaining basic processing results from RIC, then you should have the following files.
report.html to view the basic processing report.Raw sequence counts of metagenomic assembly is provided as contig_counts.txt, for each contig the number of mapped reads are displayed.
https://uic-ric.github.io/examples/shotgun_metagenomics_assembly/)contig_counts.txt file to open. The file is a tab-delimited file and can be imported into Excel.https://uic-ric.github.io/examples/shotgun_metagenomics_assembly/)contig_data.txt file to open.
Annotation of contig freatures (ORFs) is available in the contig_annotations.gff file contained in the annotation_results.zip ZIP file.
For these exercises, we will be downloading files from the Internet into the workshop folder on your Desktop, then loading the files into R. Prior to these exercies, make sure you have completed the following. If you have already created an RStudio project in the morning session, you do NOT need to complete these steps.
biomformat R packagebiomformat packageIf you have not previously done so, install the package biomformat. The biomformat package is available via bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomformat", update=F)
# load biomformat library. Allows reading of BIOM files
library(biomformat)
# Load BIOM file for amplicon sequencing genus level summary
L6_biom <- read_biom("http://wd.cri.uic.edu/metagenomics/taxa_table_L6.biom")
# Get data in BIOM file
L6_data <- biom_data(L6_biom)
# This is a taxonomic summary, so check the observation/feature names
row.names(L6_data)[1:10]
## [1] "Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Muribaculaceae;Other"
## [2] "Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Tannerellaceae;Other"
## [3] "Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnospiraceae NK4A136 group"
## [4] "Bacteria;Firmicutes;Bacilli;RF39;Other;Other"
## [5] "Bacteria;Firmicutes;Clostridia;Oscillospirales;[Eubacterium] coprostanoligenes group;Other"
## [6] "Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae;Colidextribacter"
## [7] "Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Other"
## [8] "Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae;Intestinimonas"
## [9] "Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae;Other"
## [10] "Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;[Eubacterium] xylanophilum group"
# Determine counts for all samples.
sample_counts <- apply(L6_data, 2, sum)
# View sample counts data
summary(sample_counts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31 45740 50428 49722 56674 79610
sample_counts[order(sample_counts)]
## Sample21_M25_Fec2 Sample22_M26_Fec1 Sample26_M33_Fec2 Sample23_M29_Fec3
## 31 32343 39237 39305
## Sample20_M24_Fec3 Sample27_M34_Fec3 Sample15_M16_Fec2 Sample29_M36_Fec3
## 45393 45856 46029 48789
## Sample19_M23_Fec2 Sample16_M18_Fec3 Sample28_M35_Fec3 Sample18_M20_Fec3
## 49364 49988 50868 53641
## Sample14_M15_Fec1 Sample11_M12_Fec3 Sample13_M14_Fec3 Sample12_M13_Fec1
## 55243 55259 55619 59838
## Sample24_M30_Fec3 Sample25_M32_Fec2 Sample10_M10_Fec2 Sample17_M19_Fec2
## 60474 62840 64718 79610
hist(sample_counts)
Note:If you were not able to complete the previous exercise, execute the following code to load example data.
load(url("http://wd.cri.uic.edu/metagenomics/metagenomics.RData"))
# Filter to remove chloroplast and mitochondrial data, First see what the names are...
grep("Chloroplast", row.names(L6_data), value = T)
## character(0)
grep("Mitochondria", row.names(L6_data), value = T)
## [1] "Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Mitochondria;Other"
L6_data_clean <- L6_data[! grepl("Chloroplast", row.names(L6_data)), ]
L6_data_clean <- L6_data_clean[! grepl("Mitochondria", row.names(L6_data_clean)), ]
# Recompute counts for all samples.
clean_sample_counts <- apply(L6_data_clean, 2, sum)
# View sample counts data
summary(clean_sample_counts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31 45740 50428 49722 56674 79610
clean_sample_counts[order(clean_sample_counts)]
## Sample21_M25_Fec2 Sample22_M26_Fec1 Sample26_M33_Fec2 Sample23_M29_Fec3
## 31 32343 39237 39305
## Sample20_M24_Fec3 Sample27_M34_Fec3 Sample15_M16_Fec2 Sample29_M36_Fec3
## 45393 45856 46029 48789
## Sample19_M23_Fec2 Sample16_M18_Fec3 Sample28_M35_Fec3 Sample18_M20_Fec3
## 49362 49988 50868 53641
## Sample14_M15_Fec1 Sample11_M12_Fec3 Sample13_M14_Fec3 Sample12_M13_Fec1
## 55243 55259 55619 59838
## Sample24_M30_Fec3 Sample25_M32_Fec2 Sample10_M10_Fec2 Sample17_M19_Fec2
## 60474 62840 64718 79610
hist(clean_sample_counts)
# Show difference
clean_sample_counts - sample_counts
## Sample10_M10_Fec2 Sample11_M12_Fec3 Sample12_M13_Fec1 Sample13_M14_Fec3
## 0 0 0 0
## Sample14_M15_Fec1 Sample15_M16_Fec2 Sample16_M18_Fec3 Sample17_M19_Fec2
## 0 0 0 0
## Sample18_M20_Fec3 Sample19_M23_Fec2 Sample20_M24_Fec3 Sample21_M25_Fec2
## 0 -2 0 0
## Sample22_M26_Fec1 Sample23_M29_Fec3 Sample24_M30_Fec3 Sample25_M32_Fec2
## 0 0 0 0
## Sample26_M33_Fec2 Sample27_M34_Fec3 Sample28_M35_Fec3 Sample29_M36_Fec3
## 0 0 0 0
We will need to install the package vegan that will provide a function to rarefy/subsample a table. If you were asked for a CRAN mirror site, please select one which is close to your location: e.g. “USA (IN)”
install.packages("vegan")
# load the package
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
From the previous steps we saw that all of our samples have at least 30,000 counts. So let’s perform a rarefaction at that depth.
# Perform a rarefaction at 30k
L6_rare <- t(rrarefy(t(as.matrix(L6_data_clean)), 30000))
## Warning in rrarefy(t(as.matrix(L6_data_clean)), 30000): some row sums < 'sample'
## and are not rarefied
summary(apply(L6_rare, 2, sum))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31 30000 30000 28502 30000 30000
ncol(L6_rare)
## [1] 20
# Remove samples that were not rarefied
L6_rare <- L6_rare[, apply(L6_rare, 2, sum) == 30000]
summary(apply(L6_rare, 2, sum))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30000 30000 30000 30000 30000 30000
ncol(L6_rare)
## [1] 19
# Filter to remove sample with low sequence counts
L6_data_clean <- L6_data_clean[, sample_counts > 5000]
# Generate relative sequence abundance
L6_relative <- apply(L6_data_clean, 2, function(x) { x / sum(x) })
summary(apply(L6_relative, 2, sum))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
# First generate a data frame with the total counts for each taxa and the density
plotdata <- data.frame(taxon=row.names(L6_data_clean),
counts=rowSums(as.matrix(L6_data_clean)),
density=apply(L6_data_clean, 1, function (x) { length(which(x > 0)) }))
# Sort from largest to smallest.
plotdata <- plotdata[ order(plotdata$counts, decreasing = T), ]
# Get a colors for the possible densities number of samples
density_pal <- rev(terrain.colors(ncol(L6_data_clean)))
# Set a color for each taxon based on its density.
plotdata$density_col <- factor(plotdata$density,
levels=seq(1, ncol(L6_data_clean)),
labels=density_pal)
# Plot the taxa counts and color by density
plot(plotdata$counts,
col=as.character(plotdata$density_col),
pch=16, log="y")
legend('topright', legend = seq(1, ncol(L6_data_clean)),
col = levels(plotdata$density_col), pch = 16)
plotdata$retained <- plotdata$counts >= 500 &
plotdata$density >= floor(ncol(L6_data_clean) / 2)
# Plot the taxa counts and color by filter state
plot(plotdata$counts,
col=ifelse(plotdata$retained, "blue", "red"),
pch=16, log="y")
legend('topright', legend = c("retained", "filtered"),
col = c("blue", "red"), pch = 16)
# We can also check the fraction of sequence data that will be retained
sum(plotdata$counts[ plotdata$retained ]) / sum(plotdata$counts)
## [1] 0.9957935
# Get list of taxa that satisfy the filter
filtered_taxa <- plotdata$taxon[ plotdata$retained ]
# Filter the relative sequence table
L6_relative_filtered <- L6_relative[filtered_taxa, ]
# Compare the number of rows.
nrow(L6_relative)
## [1] 76
nrow(L6_relative_filtered)
## [1] 40
# Compare the filtered totals
summary(apply(L6_relative_filtered, 2, sum))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9867 0.9948 0.9970 0.9956 0.9977 0.9989
# Get top 10 taxa
L6_top10 <- L6_relative_filtered[order(apply(L6_relative_filtered, 1, sum),
decreasing=T)[1:10], ]
# stacked barplot of the top taxa
barplot(as.matrix(L6_top10), legend.text=TRUE)
# Clean up names (Optional)
# Strip off any "Other" names and split by semicolon
clean_names <- strsplit(gsub(";Other", "", row.names(L6_top10)), ";")
# Set prefixes for the different taxonomic levels
level_labels <- c("kingdom", "phylum", "class", "order", "family", "genus")
# Use sapply to generate a name using the prefix and the last valid name in the taxonomic name
clean_names <- sapply(clean_names, function (x) { paste(level_labels[length(x)], x[length(x)])})
# Set the names to the matrix
row.names(L6_top10) <- clean_names
# add better colors and rotate the sample labels
barplot(as.matrix(L6_top10), legend.text=T, col=rainbow(10), las=2)
# put the key outside the plot area
par(mar=c(15,5,5,15)) # put a bigger margin on the right and bottom
barplot(as.matrix(L6_top10), legend.text=T, col=rainbow(10), las=2,
args.legend = list(x="right", inset=c(-0.7, 0)))
# note: set dev.off() to reset plot parameters
dev.off()
## null device
## 1
NOTE: If you had not previously installed ComplexHeatmap perform the following to install this package via Bioconductor.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap", update=F)
# Generate heatmap
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.24.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
Heatmap(L6_top10, name="Rel. seq. abund.")
L6_top10_zscore <- t(scale(t(L6_top10)))
Heatmap(L6_top10_zscore, name="Z-score")
NOTE: If you had not previously installed edgeR perform the following to install this package via Bioconductor.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("edgeR", update = F)
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
# Only retain samples with at least 5000 sequence counts
L6_filtered <- L6_data_clean[ , colSums(as.matrix(L6_data_clean)) >= 5000]
sel_samples <- intersect(row.names(mapping), colnames(L6_filtered))
mapping_subset <- mapping[sel_samples, , drop=F]
L6_subset <- L6_filtered[, sel_samples]
sample_sizes <- apply(L6_subset, 2, sum)
total_sum <- sum(sample_sizes)
L6_data_filtered <- L6_subset[rowSums(as.matrix(L6_subset)) >= 500 &
apply(L6_subset, 1, function (x) { length(which(x > 0)) })
>= floor(ncol(L6_data_clean) / 2), ]
edgeR and perform initial setup. We will use the include the original samples sizeslibrary(edgeR)
## Loading required package: limma
L6_dgelist <- DGEList(counts=L6_data_filtered,
lib.size=sample_sizes,
group=mapping_subset$Group)
L6_dgelist <- calcNormFactors(L6_dgelist, method="TMM")
# Estimate dispersions
L6_dgelist <- estimateDisp(L6_dgelist)
## Using classic mode.
L6_test <- exactTest(L6_dgelist)
# run FDR correction and add to table
QValue <- p.adjust(L6_test$table$PValue, method="BH")
L6_diff <- cbind(L6_test$table,QValue)
# Sort by increasing p value
L6_diff <- L6_diff[order(L6_diff$PValue, decreasing = F), ]
# Peek at the results
head(L6_diff)
# Write results to a file.
write.table(L6_diff, file="diff_analysis.txt", sep="\t", quote=F, col.names = NA)
# Generate the normalized counts and store as an object in R
L6_cpm <- cpm(L6_dgelist)
# Write results to a file.
write.table(L6_cpm, file="norm.txt", sep="\t", quote=F, col.names = NA)
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
sel_samples <- intersect(row.names(mapping), colnames(L6_relative_filtered))
mapping_subset <- mapping[sel_samples, , drop=F]
L6_subset <- L6_relative_filtered[, sel_samples]
L6_kw <- data.frame(matrix(nrow=nrow(L6_subset),
ncol=1 + length(unique(mapping_subset[,1])) ))
groups <- unique(mapping_subset[,1])
colnames(L6_kw) <- c(groups, "PValue")
row.names(L6_kw) <- row.names(L6_subset)
for ( t in 1:nrow(L6_subset) ) {
for ( g in 1:length(groups) ) {
L6_kw[t,g] <- mean(L6_subset[t, mapping_subset[,1] == groups[g]])
}
kw_res <- kruskal.test(L6_subset[t, ], mapping_subset[,1])
L6_kw$PValue[t] <- kw_res$p.value
}
L6_kw$QValue <- p.adjust(L6_kw$PValue, method="BH")
L6_kw <- L6_kw[ order(L6_kw$PValue, decreasing = F), ]
head(L6_kw)
write.table(L6_kw, file="group_signif.txt", sep="\t", quote=F, col.names=NA)
NOTE: If you had not previously installed the vegan library then install via CRAN. If you were asked for a CRAN mirror site, please select one which is close to your location: e.g. “USA (IN)”
install.packages("vegan")
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
library(vegan)
merge command in the next step will do that for us.alpha.div <- data.frame(H=diversity(L6_rare, index="shannon", MARGIN=2))
mapping table)L6_alpha <- merge(mapping, alpha.div, by.x=0, by.y=0)
boxplot(H ~ Group, data=L6_alpha, main="Alpha diversity",
xlab="Group", ylab="Shannon index (H)")
kruskal.test(H ~ Group, L6_alpha)
##
## Kruskal-Wallis rank sum test
##
## data: H by Group
## Kruskal-Wallis chi-squared = 8.5714, df = 1, p-value = 0.003415
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
L6_cpm <- read.delim("https://wd.cri.uic.edu/metagenomics/norm.txt",
row.names=1, check.names=F)
library(vegan)
mapping <- mapping[colnames(L6_cpm), , drop=F]
beta_div <- vegdist(t(sqrt(L6_cpm)), method="bray")
adonis2(beta_div ~ Group, data=mapping)
anosim(beta_div, mapping[labels(beta_div), "Group"])
##
## Call:
## anosim(x = beta_div, grouping = mapping[labels(beta_div), "Group"])
## Dissimilarity: bray
##
## ANOSIM statistic R: 0.7605
## Significance: 0.002
##
## Permutation: free
## Number of permutations: 999
L6_nmds <- metaMDS(beta_div)
## Run 0 stress 0.06242569
## Run 1 stress 0.06242581
## ... Procrustes: rmse 0.0001126714 max resid 0.0002763492
## ... Similar to previous best
## Run 2 stress 0.06242576
## ... Procrustes: rmse 7.94217e-05 max resid 0.0001945191
## ... Similar to previous best
## Run 3 stress 0.1312469
## Run 4 stress 0.06242568
## ... New best solution
## ... Procrustes: rmse 6.62879e-05 max resid 0.0001657766
## ... Similar to previous best
## Run 5 stress 0.1150082
## Run 6 stress 0.06242572
## ... Procrustes: rmse 0.0001342272 max resid 0.0003267098
## ... Similar to previous best
## Run 7 stress 0.06242576
## ... Procrustes: rmse 0.0001356465 max resid 0.0003296287
## ... Similar to previous best
## Run 8 stress 0.06242573
## ... Procrustes: rmse 0.0001545498 max resid 0.0003798337
## ... Similar to previous best
## Run 9 stress 0.06242582
## ... Procrustes: rmse 0.0002411579 max resid 0.0005972389
## ... Similar to previous best
## Run 10 stress 0.06242571
## ... Procrustes: rmse 9.495894e-05 max resid 0.0002302727
## ... Similar to previous best
## Run 11 stress 0.06242568
## ... Procrustes: rmse 3.583144e-05 max resid 8.748141e-05
## ... Similar to previous best
## Run 12 stress 0.0624258
## ... Procrustes: rmse 0.0002258467 max resid 0.0005580313
## ... Similar to previous best
## Run 13 stress 0.0624257
## ... Procrustes: rmse 7.941642e-05 max resid 0.0001958895
## ... Similar to previous best
## Run 14 stress 0.06242576
## ... Procrustes: rmse 0.0001920973 max resid 0.0004720616
## ... Similar to previous best
## Run 15 stress 0.06242573
## ... Procrustes: rmse 0.0001085607 max resid 0.0002642562
## ... Similar to previous best
## Run 16 stress 0.06242572
## ... Procrustes: rmse 0.0001084033 max resid 0.0002664174
## ... Similar to previous best
## Run 17 stress 0.0624257
## ... Procrustes: rmse 9.684037e-05 max resid 0.0002545257
## ... Similar to previous best
## Run 18 stress 0.06242586
## ... Procrustes: rmse 0.0002762212 max resid 0.0006849466
## ... Similar to previous best
## Run 19 stress 0.06242575
## ... Procrustes: rmse 0.0001717 max resid 0.0004226907
## ... Similar to previous best
## Run 20 stress 0.06242568
## ... Procrustes: rmse 3.011759e-05 max resid 7.92814e-05
## ... Similar to previous best
## *** Solution reached
# Get the groups in order of the items in the NMDS plot
colors <- factor(mapping[row.names(L6_nmds$points), "Group"])
group_names <- levels(colors)
# Then change the group names to colors
levels(colors) <- rainbow(length(levels(colors)))
plot(L6_nmds$points, col=as.character(colors))
plot(L6_nmds$points, col=as.character(colors), pch=16)
# add a legend
legend('topright', legend = group_names,
col = levels(colors), pch = 16)
stressplot(L6_nmds)
The following are a set of additional examples that you can try on your own.
This is an example of performing a pathway enrichment analysis of differential results for a functional gene (KEGG ortholog) table. This example assumes that you have already performed the differential analysis, e.g. edgeR or ANCOM, and have the results of the differential analysis saved to a file or currently loaded in R.
This analysis could be performed on results from either a shotgun metagenomics data or a PICRUSt inference from 16S data.
For this example of pathway enrichment of functional gene data you will need the following…
KEGGREST to fetch this information from KEGG.The following steps are run in R.
KEGGREST package, if you have not done so already.if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("KEGGREST", update=F)
NOTE: This example assumes that the results from the differential analysis (input file) has the first column as the KEGG orthology ID (KO) and column named QValue with the FDR corrected p value from the differential analysis. If your results differ then you will need to adjust how you generate the list of all_kos and diff_kos.
funl_results <- read.delim("https://wd.cri.uic.edu/metagenomics/diff_funl.txt", row.names=1)
# The tested KOs are in the row names
all_kos <- row.names(funl_results)
# The differential KOs are those in which "Reject null hypothesis" is True.
diff_kos <- row.names(funl_results)[funl_results$QValue <= 0.05]
length(diff_kos)
## [1] 7439
length(all_kos)
## [1] 10624
library(KEGGREST)
# Store the number of KOs in a variable for ease of use
ko_len <- length(all_kos)
# This data frame will have the pathway mapping information
pathway_map <- data.frame(KO=character(), pathway=character())
# The KEGG API only lets one fetch 10 records at a time.
breaks <- seq(1,ko_len, by=10)
for (b in breaks ) {
end <- min(b + 9, ko_len)
these_kos <- all_kos[b:end]
ko_data <- keggLink("pathway", these_kos)
# Only retain the ko pathways. KEGG also returns map* records that are the same.
ko_data <- ko_data[grep("^path:ko", ko_data)]
# Transform to data frame for easier selection
ko_data <- data.frame(KO=sub("^ko:", "", names(ko_data)), pathway=ko_data)
# Add to the existing pathway mapping
pathway_map <- rbind(pathway_map, ko_data)
# Sleep for a tenth of a second to not overwhelm KEGG servers
Sys.sleep(0.1)
}
dim(pathway_map)
head(pathway_map)
## [1] 21318 2
signif_pathways <- subset(pathway_map, KO %in% diff_kos)
# Get the counts of significant KOs for each of the pathways.
# You will likely want to exclude pathways with only one or a few significant KOs
signif_pathways_counts <- table(signif_pathways$pathway)
head(signif_pathways_counts)
##
## path:ko00010 path:ko00020 path:ko00030 path:ko00040 path:ko00051 path:ko00052
## 30 17 14 12 26 20
# Filter to only analyze pathways with more than three significant KOs.
# Based on your results you may want to select a different threshold
signif_pathways_counts <- signif_pathways_counts[ signif_pathways_counts > 3 ]
head(signif_pathways_counts)
##
## path:ko00010 path:ko00020 path:ko00030 path:ko00040 path:ko00051 path:ko00052
## 30 17 14 12 26 20
For this example, we will only use the first pathway from the filtered results. For your analyses, you would likely want to setup a loop to iterate over all of the pathways in the filtered list.
# Get one of the pathways.
# For this example, we are skipping over all of the pathways with more than 50 entries.
# These tend to be very large groups and maybe not that meaningful of a pathway
filtered_pathways_counts <- sort(
signif_pathways_counts[ signif_pathways_counts < 50 ], decreasing = T)
this_pathway <- names(filtered_pathways_counts)[1]
# Print the name of the pathway
this_pathway
## [1] "path:ko00230"
# Get a list of all the KOs in the pathway
kos_in_pathway <- pathway_map$KO[ pathway_map$pathway == this_pathway ]
# next compute the different parts for fisher's exact test
# size of intersection. Number of differential KOs in this pathway
intersection <- length(intersect(diff_kos,kos_in_pathway))
# number of non-differential pathway KOs
other_pathway <- length(kos_in_pathway) - intersection
# number of differential KOs that are not in the pathway
other_diff <- length(diff_kos) - intersection
# all other genes in genome
other_kos <- length(all_kos) - intersection - other_pathway - other_diff
# format into contingency matrix
assoc <- matrix(c(intersection, other_pathway, other_diff, other_kos), nrow=2)
# Add column and row names to make a nice table display.
# NOTE: This is not needed for the analysis
colnames(assoc) <- c("In pathway", "Other")
row.names(assoc) <- c("Differential", "No difference")
assoc
## In pathway Other
## Differential 36 7403
## No difference 117 3068
# run fisher's exact test
fet <- fisher.test(assoc)
fet
##
## Fisher's Exact Test for Count Data
##
## data: assoc
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.08500913 0.18720334
## sample estimates:
## odds ratio
## 0.1275471
NOTE: You only need to do this once.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ANCOMBC", update=F)
ANCOM expects to have the input data as a SummedExperiment or TreeSummedExperiment. The base sequence table (taxa_table.biom) can be loaded via the loadFromBiom function in the mia package (installed with ANCOMBC package)
biom_tse <- loadFromBiom("http://wd.cri.uic.edu/metagenomics/taxa_table.biom")
## Warning in loadFromBiom("http://wd.cri.uic.edu/metagenomics/taxa_table.biom"):
## 'loadFromBiom' is deprecated. Use 'importBIOM' instead.
# Take a quick peek
biom_tse
## class: TreeSummarizedExperiment
## dim: 475 20
## metadata(0):
## assays(1): counts
## rownames(475): asv0000001 asv0000002 ... asv0000474 asv0000475
## rowData names(8): taxonomy1 taxonomy2 ... taxonomy7 taxonomy_unparsed
## colnames(20): Sample10_M10_Fec2 Sample11_M12_Fec3 ... Sample28_M35_Fec3
## Sample29_M36_Fec3
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
names(rowData(biom_tse)) <- c("Kingdom", "Phylum", "Class", "Order",
"Family", "Genus", "Species")
biom_tse object.mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
# Ensure the samples in the mapping data.frame is in the same order as the biom_tse object.
mapping <- mapping[colnames(biom_tse), , drop=F]
# Add to the biom_tse. Will need to convert to DataFrame (special class for
# SummedExperiment objects)
colData(biom_tse) <- DataFrame(mapping)
# Take a final peek at the object.
# 1. We should see the the rowData names are the taxonomic levels
# 2. The colData names should be "Group" (the experimental Group from the mapping file)
biom_tse
## class: TreeSummarizedExperiment
## dim: 475 20
## metadata(0):
## assays(1): counts
## rownames(475): asv0000001 asv0000002 ... asv0000474 asv0000475
## rowData names(8): Kingdom Phylum ... Species NA
## colnames(20): NA Sample11_M12_Fec3 ... NA.4 NA.5
## colData names(1): Group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
ancombc2 function from the ANCOMBC library. We will run the analysis at the Genus level while computing adjusted p values using a Benjamini-Hochberg FDR correction. The fix_formula parameter sets the formula to fit for the data. As we only have a single factor (“Group”) then the formula will just be this term. The prv_cut parameter sets percent of samples in which the taxa must be present to be evaluated. In this instance, we will set to 0.5 (50%) to match the analyses in the Afternoon exercises.ancom_res <- ancombc2(biom_tse,
tax_level="Genus",
p_adj_method = "BH",
fix_formula = "Group",
prv_cut=0.5)
## Checking the input data type ...
## The input data is of type: TreeSummarizedExperiment
## PASS
## Checking the sample metadata ...
## The specified variables in the formula: Group
## The available variables in the sample metadata: Group
## PASS
## Checking other arguments ...
## PASS
## Obtaining initial estimates ...
## Estimating sample-specific biases ...
## Loading required package: foreach
## Loading required package: rngtools
## ANCOM-BC2 primary results ...
## Conducting sensitivity analysis for pseudo-count addition to 0s ...
## For taxa that are significant but do not pass the sensitivity analysis,
## they are marked in the 'passed_ss' column and will be treated as non-significant in the 'diff_robust' column.
## For detailed instructions on performing sensitivity analysis, please refer to the package vignette.
ancom_res object.ancom_diff <- ancom_res$res
GroupTreatment. As we had tested the factor “Group” and there were two levels (“Control” and “Treatment”). The first level (Control) is equivalent to the intercept and the computed stats will be relative to that level.# Get the columns
ancom_diff <- ancom_diff[, c(1, grep("GroupTreatment$", colnames(ancom_diff)))]
# Strip the suffix from the column names
colnames(ancom_diff) <- sub("_GroupTreatment$", "", colnames(ancom_diff))
# Take a peek
head(ancom_diff)
write.table(ancom_diff, file="ancom_diff.txt", sep="\t", row.names=F, quote=F)
plot(ancom_diff$lfc, ancom_diff$q,
log="y", type="p", ylim=c(1, min(ancom_diff$q)),
xlab="Log2 Fold Change (Treatment - Control)",
ylab="FDR corrected p Value",
col=ifelse(ancom_diff$diff, "red", "black"))
# Get the significant features
plotdata <- subset(ancom_diff, diff)
# Sort by lfc
plotdata <- plotdata[order(plotdata$lfc), ]
# Adjust the margins to increase the left margin
par(mar=c(5, 12, 4, 2))
# Generate the base bar plot and color by down vs. up
base_barplot <- barplot(plotdata$lfc,
names.arg = plotdata$taxon,
col=ifelse(plotdata$lfc < 0, "blue", "red"),
horiz = TRUE,
xlim=range( plotdata$lfc + ( sign(plotdata$lfc) * plotdata$se)),
las=1,
xlab="Log2 fold change")
# Use the arrows function to add the whiskers for the standard error
arrows(y0=base_barplot,
x0=plotdata$lfc + plotdata$se,
x1=plotdata$lfc - plotdata$se,
angle=90,
code=3,
length=0.1)
The following is an example workflow for processing raw FASTQ data using QIIME2.
Import manifest
An import manifest was included with the example data. However, if you needed to create your own import manifest for your own data the import manifest should have the following columns.
sample-id - Sample IDforward-absolute-filepath - Absolute path to forward read (read 1) FASTQ file. Can use $PWD to denote current directory.reverse-absolute-filepath - Absolute path to reverse read (read 2) FASTQ file. Can use $PWD to denote current directory.The following is an example manifest file.
## sample-id forward-absolute-filepath reverse-absolute-filepath
## Sample_001_P1 $PWD/Sample-001-P1_S61_L001_R1_001.fastq.gz $PWD/Sample-001-P1_S61_L001_R2_001.fastq.gz
## Sample_002_P2 $PWD/Sample-002-P2_S62_L001_R1_001.fastq.gz $PWD/Sample-002-P2_S62_L001_R2_001.fastq.gz
Processing steps
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest.txt \
--output-path paired-end-demux.qza \
--input-format PairedEndFastqManifestPhred33V2
## Imported manifest.txt as PairedEndFastqManifestPhred33V2 to paired-end-demux.qza
qiime demux summarize --i-data paired-end-demux.qza \
--o-visualization paired-end-demux-summary.qzv
## Saved Visualization to: paired-end-demux-summary.qzv
qiime tools view paired-end-demux-summary.qzv or go to https://view.qiime2.org to view the visualization.cutadapt to trim on quality, adapters and minimum length. Also, discard if missing adapters. NOTE: you will need to adjust the sequence of the forward (--p-front-f) and reverse (--p-front-r) primers based on the primers used and adjust --p-minimum-length based on your sequencing parameters, e.g. was it 2x150 or 2x300. The following are the parameters for data using the 515F and 806R primers (V4 region) with 2x150 bp sequencingqiime cutadapt trim-paired \
--p-front-f ^GTGCCAGCMGCCGCGGTAA \
--p-front-r ^GGACTACHVGGGTWTCTAAT \
--p-error-rate 0.10 \
--p-minimum-length 100 \
--p-discard-untrimmed \
--i-demultiplexed-sequences paired-end-demux.qza \
--o-trimmed-sequences trimmed-reads.qza
## Saved SampleData[PairedEndSequencesWithQuality] to: trimmed-reads.qza
qiime demux summarize --i-data trimmed-reads.qza --o-visualization trimmed-reads-summary.qzv
## Saved Visualization to: trimmed-reads-summary.qzv
qiime tools view trimmed-reads-summary.qzv or go to https://view.qiime2.org to view the visualization.dada2 to merge and denoise the sequence data. NOTE: you may need to adjust these parameters based on your particular data and sequencing setup.qiime dada2 denoise-paired \
--i-demultiplexed-seqs trimmed-reads.qza \
--p-chimera-method consensus \
--p-hashed-feature-ids \
--p-trunc-len-f 0 --p-trunc-len-r 0 \
--p-trunc-q 15 \
--o-table dada2_table.qza \
--o-representative-sequences dada2_rep_set.qza \
--o-denoising-stats dada2_stats.qza
## Saved FeatureTable[Frequency] to: dada2_table.qza
## Saved FeatureData[Sequence] to: dada2_rep_set.qza
## Saved SampleData[DADA2Stats] to: dada2_stats.qza
qiime tools export --input-path dada2_stats.qza --output-path dada2_stats
## Exported dada2_stats.qza as DADA2StatsDirFmt to directory dada2_stats
less dada2_stats/stats.tsv
## sample-id input filtered percentage of input passed filter denoised merged percentage of input #q2:types numeric numeric numeric numeric numeric numeric numeric numeric
## Sample_001_P1 4846 4104 84.69 4053 3725 76.87 3684 76.02
## Sample_002_P2 4857 4068 83.76 3991 3636 74.86 3594 74
## Sample_003_P3 4878 4098 84.01 4052 3663 75.09 3599 73.78
## Sample_007_C1 4865 4096 84.19 4017 3597 73.94 3515 72.25
## Sample_008_C1 4876 4075 83.57 4015 3602 73.87 3533 72.46
## Sample_009_C3 4862 4030 82.89 3926 3493 71.84 3398 69.89
qiime metadata tabulate --m-input-file dada2_stats.qza \
--o-visualization dada2_stats_table.qzv
## Saved Visualization to: dada2_stats_table.qzv
qiime tools view dada2_stats_table.qzv or go to https://view.qiime2.org to view the visualization.Download the appropriate QIIME2 Silva reference database file from https://docs.qiime2.org/2021.8/data-resources/.
Run the feature classifier
qiime feature-classifier classify-sklearn \
--i-reads dada2_rep_set.qza \
--i-classifier silva-138-99-515-806-nb-classifier.qza \
--o-classification rep_set_assignments.qza
## Saved FeatureData[Taxonomy] to: rep_set_assignments.qza
qiime metadata tabulate \
--m-input-file rep_set_assignments.qza \
--o-visualization assignments_stats.qzv
## Saved Visualization to: assignments_stats.qzv
qiime tools view assignments_stats.qzv or go to https://view.qiime2.org to view the visualization.qiime taxa filter-table \
--i-table dada2_table.qza \
--i-taxonomy rep_set_assignments.qza \
--p-exclude c__Chloroplast,f__mitochondria \
--o-filtered-table dada2_table_clean.qza
## Saved FeatureTable[Frequency] to: dada2_table_clean.qza
qiime feature-table summarize \
--i-table dada2_table_clean.qza \
--o-visualization dada2_table_clean_summary.qzv
## Saved Visualization to: dada2_table_clean_summary.qzv
Either run qiime tools view dada2_table_clean_summary.qzv or go to https://view.qiime2.org to view the visualization.
Generate a genus level summary of the clean DADA2 table
qiime taxa collapse \
--i-table dada2_table_clean.qza \
--i-taxonomy rep_set_assignments.qza \
--p-level 6 \
--o-collapsed-table genus_summary_clean.qza
## Saved FeatureTable[Frequency] to: genus_summary_clean.qza
qiime tools export \
--input-path genus_summary_clean.qza \
--output-path genus_summary_clean
## Exported genus_summary_clean.qza as BIOMV210DirFmt to directory genus_summary_clean
RR
R# load biomformat library. Allows reading of BIOM files
library(biomformat)
# Load BIOM file for genus level summary
genus_biom <- read_biom("genus_summary_clean/feature-table.biom")
# NOTE! We have noticed some issues with the biomformat library.
# If the previous read command fails with a message that includes the following...
# `unable to translate 'lexical error: invalid char in json text.`
# Then use the following command to load the BIOM file.
genus_biom <- biom(read_hdf5_biom("genus_summary_clean/feature-table.biom"))
# Get data in BIOM file
genus_data <- biom_data(genus_biom)
# This is a taxonomic summary, so check the observation/feature names
row.names(genus_data)[1:10]
# Determine counts for all samples.
sample_counts <- apply(genus_data, 2, sum)
# View sample counts data
summary(sample_counts)
sample_counts[order(sample_counts)]
hist(sample_counts)